Kwanza Tukule Data Analysis¶

This report showcases the completion of the Kwanza Tukule Data Analyst Assessment, designed to evaluate my technical, analytical, and problem-solving skills. Using the provided anonymized sales dataset, I applied data cleaning, preparation, and exploratory analysis techniques to uncover actionable insights. The analysis includes identifying trends, customer segmentation, forecasting, and anomaly detection to address key business challenges. Strategic recommendations are derived from the findings, focusing on product performance, customer retention, and operational efficiency. Additionally, I developed a dashboard to visualize key metrics, enabling better decision-making. This report demonstrates my ability to analyze data effectively and present meaningful insights in a structured, professional manner.

Importing Libraries¶

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline


#html export
import plotly.io as pio
pio.renderers.default = 'notebook'

Loading Dataset¶

In [3]:
df = pd.read_excel(r'Case Study Data - Read Only.xlsx')
df
Out[3]:
DATE ANONYMIZED CATEGORY ANONYMIZED PRODUCT ANONYMIZED BUSINESS ANONYMIZED LOCATION QUANTITY UNIT PRICE
0 2024-08-18 21:32:00 Category-106 Product-21f4 Business-de42 Location-1ba8 1 850.0
1 2024-08-18 21:32:00 Category-120 Product-4156 Business-de42 Location-1ba8 2 1910.0
2 2024-08-18 21:32:00 Category-121 Product-49bd Business-de42 Location-1ba8 1 3670.0
3 2024-08-18 21:32:00 Category-76 Product-61dd Business-de42 Location-1ba8 1 2605.0
4 2024-08-18 21:32:00 Category-119 Product-66e0 Business-de42 Location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333400 2024-01-09 20:49:00 Category-119 Product-e98d Business-f9ff Location-1979 1 1770.0
333401 2024-08-20 20:24:00 Category-106 Product-21f4 Business-72bc Location-689f 1 850.0
333402 2024-08-20 20:24:00 Category-76 Product-6e9c Business-72bc Location-689f 1 2575.0
333403 2024-10-10 18:19:00 Category-91 Product-523e Business-2693 Location-b27b 1 1520.0
333404 2024-10-10 18:19:00 Category-75 Product-b31e Business-2693 Location-b27b 1 4420.0

333405 rows × 7 columns

Attributes¶

  1. DATE: Represents the date on which the transaction or activity occurred.

  2. ANONYMIZED CATEGORY: Categorization of the product or service involved in the transaction, with the actual category names replaced by anonymized labels.

  3. ANONYMIZED PRODUCT: Refers to the specific product associated with the transaction, with product names anonymized for confidentiality.

  4. ANONYMIZED BUSINESS: Represents the business entity involved in the transaction.

  5. ANONYMIZED LOCATION: The location where the transaction or activity occurred.

  6. QUANTITY: The quantity of the product involved in the transaction, recorded as an integer.

  7. UNIT PRICE: The price per unit of the product.

Data Cleaning and Preparation¶

In [4]:
# make column names and values uniform
df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = df.dtypes[df.dtypes == 'object'].index
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')
In [5]:
df
Out[5]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
0 2024-08-18 21:32:00 category-106 product-21f4 business-de42 location-1ba8 1 850.0
1 2024-08-18 21:32:00 category-120 product-4156 business-de42 location-1ba8 2 1910.0
2 2024-08-18 21:32:00 category-121 product-49bd business-de42 location-1ba8 1 3670.0
3 2024-08-18 21:32:00 category-76 product-61dd business-de42 location-1ba8 1 2605.0
4 2024-08-18 21:32:00 category-119 product-66e0 business-de42 location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333400 2024-01-09 20:49:00 category-119 product-e98d business-f9ff location-1979 1 1770.0
333401 2024-08-20 20:24:00 category-106 product-21f4 business-72bc location-689f 1 850.0
333402 2024-08-20 20:24:00 category-76 product-6e9c business-72bc location-689f 1 2575.0
333403 2024-10-10 18:19:00 category-91 product-523e business-2693 location-b27b 1 1520.0
333404 2024-10-10 18:19:00 category-75 product-b31e business-2693 location-b27b 1 4420.0

333405 rows × 7 columns

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333405 entries, 0 to 333404
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   date                 333405 non-null  datetime64[ns]
 1   anonymized_category  333405 non-null  object        
 2   anonymized_product   333405 non-null  object        
 3   anonymized_business  333405 non-null  object        
 4   anonymized_location  333405 non-null  object        
 5   quantity             333405 non-null  int64         
 6   unit_price           333397 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 17.8+ MB
In [7]:
# duplicates
df.duplicated().sum()
Out[7]:
3524

There are 3524 duplicates. We need to drop them

In [8]:
#dropping duplicates
df = df.drop_duplicates()
In [9]:
df.isna().sum()
Out[9]:
date                   0
anonymized_category    0
anonymized_product     0
anonymized_business    0
anonymized_location    0
quantity               0
unit_price             8
dtype: int64

The column 'unit_price' has 8 null values

In [10]:
df[df.unit_price.isna()]
Out[10]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
108112 2024-07-03 17:53:00 category-94 product-3d7f business-4fce location-f37d 2 NaN
150961 2024-12-16 18:33:00 category-79 product-dfc8 business-8bbf location-3fc0 1 NaN
151142 2024-12-22 14:42:00 category-122 product-15e0 business-c575 location-1979 3 NaN
272379 2024-06-27 12:15:00 category-92 product-ccbc business-14b6 location-1979 1 NaN
278284 2024-08-14 21:09:00 category-101 product-84a5 business-4be1 location-bb69 21 NaN
278384 2024-12-30 14:17:00 category-95 product-15f3 business-1a74 location-f37d 1 NaN
310385 2024-03-31 14:03:00 category-114 product-9204 business-c9dc location-689f 1 NaN
327152 2024-08-13 16:20:00 category-107 product-7eed business-0d61 location-1ba8 1 NaN

Lets try to see whether its the anonymized_product that determines the unit price so that we fill the nulls

In [11]:
df_product_3d7fdf = df[df['anonymized_product'] == 'product-3d7f']
df_product_3d7fdf
Out[11]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
196 2024-09-25 13:05:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
287 2024-11-13 19:36:00 category-94 product-3d7f business-7f77 location-8959 1 860.0
319 2024-11-01 16:54:00 category-94 product-3d7f business-2b91 location-3fc0 1 860.0
372 2024-07-11 11:20:00 category-94 product-3d7f business-2b24 location-66f4 5 875.0
447 2024-12-09 19:35:00 category-94 product-3d7f business-1b52 location-7f37 1 885.0
... ... ... ... ... ... ... ...
332771 2024-10-15 20:01:00 category-94 product-3d7f business-0ea3 location-4ea1 1 860.0
332802 2024-10-21 19:04:00 category-94 product-3d7f business-5415 location-128a 1 860.0
332842 2024-09-20 13:43:00 category-94 product-3d7f business-5760 location-689f 1 860.0
333002 2024-10-08 17:15:00 category-94 product-3d7f business-8603 location-b27b 1 860.0
333188 2024-11-08 21:20:00 category-94 product-3d7f business-8bbf location-3fc0 1 860.0

3164 rows × 7 columns

In [12]:
df_product_3d7fdf['unit_price'].unique()
Out[12]:
array([860., 875., 885., 870.,  nan])

For product 'product-3d7f' we can see there's 3 different unit prices: 860.0, 875.0, 885.0, 870.0

In [13]:
df_product_3d7fdf[df_product_3d7fdf['unit_price'] == 860.0]
Out[13]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
196 2024-09-25 13:05:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
287 2024-11-13 19:36:00 category-94 product-3d7f business-7f77 location-8959 1 860.0
319 2024-11-01 16:54:00 category-94 product-3d7f business-2b91 location-3fc0 1 860.0
565 2024-09-15 15:52:00 category-94 product-3d7f business-f749 location-3fc0 2 860.0
780 2024-10-03 15:08:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
... ... ... ... ... ... ... ...
332771 2024-10-15 20:01:00 category-94 product-3d7f business-0ea3 location-4ea1 1 860.0
332802 2024-10-21 19:04:00 category-94 product-3d7f business-5415 location-128a 1 860.0
332842 2024-09-20 13:43:00 category-94 product-3d7f business-5760 location-689f 1 860.0
333002 2024-10-08 17:15:00 category-94 product-3d7f business-8603 location-b27b 1 860.0
333188 2024-11-08 21:20:00 category-94 product-3d7f business-8bbf location-3fc0 1 860.0

2344 rows × 7 columns

Lets try to see whether the anonymized category is responsible for the price 860.0

In [14]:
df_product_3d7fdf[df_product_3d7fdf['anonymized_category'] == 'category-94']
Out[14]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
196 2024-09-25 13:05:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
287 2024-11-13 19:36:00 category-94 product-3d7f business-7f77 location-8959 1 860.0
319 2024-11-01 16:54:00 category-94 product-3d7f business-2b91 location-3fc0 1 860.0
372 2024-07-11 11:20:00 category-94 product-3d7f business-2b24 location-66f4 5 875.0
447 2024-12-09 19:35:00 category-94 product-3d7f business-1b52 location-7f37 1 885.0
... ... ... ... ... ... ... ...
332771 2024-10-15 20:01:00 category-94 product-3d7f business-0ea3 location-4ea1 1 860.0
332802 2024-10-21 19:04:00 category-94 product-3d7f business-5415 location-128a 1 860.0
332842 2024-09-20 13:43:00 category-94 product-3d7f business-5760 location-689f 1 860.0
333002 2024-10-08 17:15:00 category-94 product-3d7f business-8603 location-b27b 1 860.0
333188 2024-11-08 21:20:00 category-94 product-3d7f business-8bbf location-3fc0 1 860.0

3164 rows × 7 columns

Even for product 'product-3d7f' with category 'category-94', we still have no supporting data to associate the unit price to.

Hence, its safe to drop the nulls

In [15]:
# dropping nulls
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 329873 entries, 0 to 333404
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   date                 329873 non-null  datetime64[ns]
 1   anonymized_category  329873 non-null  object        
 2   anonymized_product   329873 non-null  object        
 3   anonymized_business  329873 non-null  object        
 4   anonymized_location  329873 non-null  object        
 5   quantity             329873 non-null  int64         
 6   unit_price           329873 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 20.1+ MB

Summary of Issues Identified and Steps Taken to Resolve Them:¶

  1. Duplicate Records:
  • Issue Identified: The dataset initially contained 3524 duplicate records, which could skew the analysis and affect the accuracy of any insights derived from the data.
  • Steps Taken: The duplicates were identified and dropped from the dataset using the .duplicated().sum() method and .drop_duplicates() to ensure each record is unique and accurately represents individual sales transactions.
  1. Null Values in 'unit_price' Column:
  • Issue Identified: There were 8 null values in the unit_price column, which is critical as the price of the product must be available for proper calculations of sales value and other analyses.
  • Steps Taken: The null values in the unit_price column were identified using .isnull().sum() and subsequently dropped from the dataset using .dropna() to remove any incomplete records that might lead to errors in subsequent analysis.

Final Dataset: After addressing these issues, the dataset now contains 329,873 non-null records, with no duplicate entries and all null values removed from the unit_price column. The dataset's columns now have consistent data types and no missing or redundant data, making it ready for further analysis.

Feature Engineering:
Creating the column: “Month-Year” (e.g., August 2024) from the “DATE” column.

In [16]:
df
Out[16]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
0 2024-08-18 21:32:00 category-106 product-21f4 business-de42 location-1ba8 1 850.0
1 2024-08-18 21:32:00 category-120 product-4156 business-de42 location-1ba8 2 1910.0
2 2024-08-18 21:32:00 category-121 product-49bd business-de42 location-1ba8 1 3670.0
3 2024-08-18 21:32:00 category-76 product-61dd business-de42 location-1ba8 1 2605.0
4 2024-08-18 21:32:00 category-119 product-66e0 business-de42 location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333398 2024-11-13 13:59:00 category-121 product-898d business-0e99 location-689f 5 1360.0
333401 2024-08-20 20:24:00 category-106 product-21f4 business-72bc location-689f 1 850.0
333402 2024-08-20 20:24:00 category-76 product-6e9c business-72bc location-689f 1 2575.0
333403 2024-10-10 18:19:00 category-91 product-523e business-2693 location-b27b 1 1520.0
333404 2024-10-10 18:19:00 category-75 product-b31e business-2693 location-b27b 1 4420.0

329873 rows × 7 columns

In [17]:
# since column 'date' is already a datetime object,

# ------------Create the 'Month-Year' column---------------#
df2 = df.copy() # creating a copy for original cleaned df

# Create a 'Month-Year' column, keeping the date as the first of the month
df2['Month-Year'] = df['date'].dt.to_period('M').dt.to_timestamp()

# Format it for display, but still keep the datetime
df2['Month-Year'] = df2['Month-Year'].dt.strftime('%B %Y')

# Drop the 'date' column
df2.drop(columns=['date'], inplace=True)

# Move 'Month-Year' column to the first position
columns = ['Month-Year'] + [col for col in df2.columns if col != 'Month-Year']
df2 = df2[columns]

# Display the updated DataFrame to check the new column
df2
Out[17]:
Month-Year anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
0 August 2024 category-106 product-21f4 business-de42 location-1ba8 1 850.0
1 August 2024 category-120 product-4156 business-de42 location-1ba8 2 1910.0
2 August 2024 category-121 product-49bd business-de42 location-1ba8 1 3670.0
3 August 2024 category-76 product-61dd business-de42 location-1ba8 1 2605.0
4 August 2024 category-119 product-66e0 business-de42 location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333398 November 2024 category-121 product-898d business-0e99 location-689f 5 1360.0
333401 August 2024 category-106 product-21f4 business-72bc location-689f 1 850.0
333402 August 2024 category-76 product-6e9c business-72bc location-689f 1 2575.0
333403 October 2024 category-91 product-523e business-2693 location-b27b 1 1520.0
333404 October 2024 category-75 product-b31e business-2693 location-b27b 1 4420.0

329873 rows × 7 columns

EDA (Exploratory Data Analysis)¶

Sales Overview¶

In [18]:
# Calculate the 'Value' column (quantity × unit_price)
df3 = df2.copy()
df3['Value'] = df3['quantity'] * df3['unit_price']
df3
Out[18]:
Month-Year anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price Value
0 August 2024 category-106 product-21f4 business-de42 location-1ba8 1 850.0 850.0
1 August 2024 category-120 product-4156 business-de42 location-1ba8 2 1910.0 3820.0
2 August 2024 category-121 product-49bd business-de42 location-1ba8 1 3670.0 3670.0
3 August 2024 category-76 product-61dd business-de42 location-1ba8 1 2605.0 2605.0
4 August 2024 category-119 product-66e0 business-de42 location-1ba8 5 1480.0 7400.0
... ... ... ... ... ... ... ... ...
333398 November 2024 category-121 product-898d business-0e99 location-689f 5 1360.0 6800.0
333401 August 2024 category-106 product-21f4 business-72bc location-689f 1 850.0 850.0
333402 August 2024 category-76 product-6e9c business-72bc location-689f 1 2575.0 2575.0
333403 October 2024 category-91 product-523e business-2693 location-b27b 1 1520.0 1520.0
333404 October 2024 category-75 product-b31e business-2693 location-b27b 1 4420.0 4420.0

329873 rows × 8 columns

Calculating the sales for each unique category

In [19]:
# Group by anonymized_category
category_sales = df3.groupby('anonymized_category').agg(
    Total_Quantity=('quantity', 'sum'),
    Total_Value=('Value', 'sum')
).reset_index()

category_sales
Out[19]:
anonymized_category Total_Quantity Total_Value
0 category-100 76824 134902751.0
1 category-101 19564 35577822.0
2 category-102 1786 464463.0
3 category-104 1217 1557598.0
4 category-105 1579 2690719.0
5 category-106 6521 5932763.0
6 category-107 2729 4170797.0
7 category-108 9756 5101375.0
8 category-109 1446 1263226.0
9 category-110 10529 5483386.0
10 category-111 6715 4387343.0
11 category-113 741 1254083.0
12 category-114 3 8600.0
13 category-115 348 425360.0
14 category-116 856 422745.0
15 category-117 5 1550.0
16 category-118 21 7560.0
17 category-119 68332 103454819.0
18 category-120 169715 319178743.0
19 category-121 14669 22327643.0
20 category-122 1223 3493480.0
21 category-123 286 730730.0
22 category-124 4 10060.0
23 category-125 123 297060.0
24 category-74 941 1927871.0
25 category-75 151330 544658700.0
26 category-76 71719 344939553.0
27 category-77 28455 76741382.0
28 category-78 9766 9792609.0
29 category-79 2215 1184953.0
30 category-81 142 72061.0
31 category-82 4759 3930818.0
32 category-83 2436 4039483.0
33 category-84 11933 6798158.0
34 category-85 22997 33762533.0
35 category-86 8 3320.0
36 category-89 238 136850.0
37 category-90 15 15750.0
38 category-91 20853 44152103.0
39 category-92 6953 10468723.0
40 category-94 23668 16750815.0
41 category-95 4116 7466932.0
42 category-96 1427 2249424.0
43 category-97 2711 2628309.0
44 category-98 2152 2519695.0
45 category-99 1964 1589480.0
In [ ]:
# graphing total sales and total value per anonymized category
import plotly.graph_objects as go

# bar chart for Total Quantity
bar = go.Bar(
    x=category_sales["anonymized_category"],
    y=category_sales["Total_Quantity"],
    name="Total Quantity",
    marker=dict(color="steelblue")
)

# line chart for Total Value
line = go.Scatter(
    x=category_sales["anonymized_category"],
    y=category_sales["Total_Value"],
    name="Total Value",
    mode="lines+markers",  # Line with markers
    line=dict(color="firebrick", width=3),  # Line color and width
    marker=dict(size=8),  # Marker size
    yaxis="y2"  # Assign to secondary y-axis
)

# Combine both traces
fig = go.Figure(data=[bar, line])

# Update layout for dual y-axes
fig.update_layout(
    title="Sales Overview by Category",
    xaxis_title="Category",
    yaxis=dict(
        title="Total Quantity",
        titlefont=dict(color="steelblue"),  # Match bar chart color
        tickfont=dict(color="steelblue")  # Match bar chart color
    ),
    yaxis2=dict(
        title="Total Value",
        titlefont=dict(color="firebrick"),  # Match line chart color
        tickfont=dict(color="firebrick"),  # Match line chart color
        overlaying="y",  # Overlay on the same plot
        side="right"  # Position on the right
    ),
    legend_title="Metrics",
    barmode="group",  # Group bars together
    template="plotly_white"  # Cleaner theme
)

# Show the figure
fig.show()

∴ From the combined Line and Bar graph plot, we can deduce 'category-120' has the highest amount of sales with over 169.7k total quantities sold.
However, when it comes to the total value gained, 'category-75' has the highest sales with over 544.6M in revenue

Calculating the sales for each unique business.

In [21]:
# Group by anonymized_business
business_sales = df3.groupby('anonymized_business').agg(
    Total_Quantity=('quantity', 'sum'),
    Total_Value=('Value', 'sum')
).reset_index()

business_sales
Out[21]:
anonymized_business Total_Quantity Total_Value
0 business-0000 8 10445.0
1 business-0005 1 2645.0
2 business-0029 26 77340.0
3 business-003d 98 221761.0
4 business-0072 127 225056.0
... ... ... ...
4795 business-ffa9 3 6740.0
4796 business-ffae 6 10530.0
4797 business-ffb1 266 438115.0
4798 business-ffd2 37 67723.0
4799 business-ffff 110 110285.0

4800 rows × 3 columns

For the sales per Business comparison, 2 charts is appropriate due to the large number of values plotted

In [22]:
# Line chart for Total Value (green)
bar_value = go.Line(
    x=business_sales["anonymized_business"],
    y=business_sales["Total_Value"],
    name="Total Value",
    marker=dict(color="green"),
)

# layout
layout_value = go.Layout(
    title="Total Value by Business",
    xaxis_title="Business",
    yaxis_title="Total Value",
    template="plotly_white",
)

# figure
fig_value = go.Figure(data=[bar_value], layout=layout_value)

# Show the plot
fig_value.show()
c:\Program Files\Python312\Lib\site-packages\plotly\graph_objs\_deprecations.py:378: DeprecationWarning:

plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.


∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 28M revenue.

In [23]:
# bar chart for Total Quantity (blue)
bar_quantity = go.Line(
    x=business_sales["anonymized_business"],
    y=business_sales["Total_Quantity"],
    name="Total Quantity",
    marker=dict(color="blue"),
)

# layout
layout_quantity = go.Layout(
    title="Total Quantity by Business",
    xaxis_title="Business",
    yaxis_title="Total Quantity",
    template="plotly_white",
)

# figure
fig_quantity = go.Figure(data=[bar_quantity], layout=layout_quantity)

# Show the plot
fig_quantity.show()

∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 13.9k total quantities sold.

Trends Over Time¶

To create a time series plot we need a datetime variable

In [24]:
# Create a copy of df
df4 = df.copy()

# Add 'Value' column
df4['value'] = df4['quantity'] * df4['unit_price']

# Create a new column 'Month-Year' in df4 (not df)
df4['month-year'] = df4['date'].dt.to_period('M').dt.to_timestamp()

# Format 'Month-Year' to display as "August 2024"
df4['month-year-formatted'] = df4['month-year'].dt.strftime('%B %Y')

# Reorder columns to make 'Month-Year-Formatted' the first column
df4 = df4[['month-year-formatted'] + [col for col in df4.columns if col != 'month-year-formatted']]
In [25]:
df4.info()
<class 'pandas.core.frame.DataFrame'>
Index: 329873 entries, 0 to 333404
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   month-year-formatted  329873 non-null  object        
 1   date                  329873 non-null  datetime64[ns]
 2   anonymized_category   329873 non-null  object        
 3   anonymized_product    329873 non-null  object        
 4   anonymized_business   329873 non-null  object        
 5   anonymized_location   329873 non-null  object        
 6   quantity              329873 non-null  int64         
 7   unit_price            329873 non-null  float64       
 8   value                 329873 non-null  float64       
 9   month-year            329873 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 27.7+ MB

Our datetime object is set to the first of the month for aggregation, for instance, any date in January is recorded as Jan 1, 2024

In [26]:
# Group the data by 'Month-Year' and calculate the monthly average of 'Quantity' and 'Value'
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()

# Convert 'Month-Year' to a string for plotting purposes
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)

# Group the data by 'Month-Year' and calculate the monthly averages
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()

# Convert 'Month-Year' to a string for plotting
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)

# Create a line plot with dual y-axes
fig = go.Figure()

# Add the line for Value
fig.add_trace(go.Scatter(
    x=monthly_avg_sales['month-year'], 
    y=monthly_avg_sales['value'], 
    name='Average Value', 
    line=dict(color='red'),
    yaxis='y1'
))

# Add the line for Quantity
fig.add_trace(go.Scatter(
    x=monthly_avg_sales['month-year'], 
    y=monthly_avg_sales['quantity'], 
    name='Average Quantity', 
    line=dict(color='green'),
    yaxis='y2'
))

# Update layout for dual y-axes
fig.update_layout(
    title="Monthly Averages of Sales (Quantity and Value)",
    xaxis=dict(title="Month-Year"),
    yaxis=dict(
        title="Average Value",
        titlefont=dict(color="blue"),
        tickfont=dict(color="blue"),
    ),
    yaxis2=dict(
        title="Average Quantity",
        titlefont=dict(color="green"),
        tickfont=dict(color="green"),
        anchor="x",
        overlaying="y",
        side="right",
    ),
    legend=dict(x=0.5, y=1.1, orientation="h"),
)

# Show the plot
fig.show()

Monthly averages are useful for understanding the typical performance per month, especially when there are outliers or variations in the data

For the monthly average sales(Both quantity and value), there is a downward slope,indicating a decreasing trend in the sales over time.

In [ ]:
# Group the data by 'Month-Year' and calculate the monthly sum of 'Quantity' and 'Value'
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()

# Convert 'Month-Year' to a string for plotting purposes
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)

# Group the data by 'Month-Year' and calculate the monthly averages
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()

# Convert 'Month-Year' to a string for plotting
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)

# Create a line plot with dual y-axes
fig = go.Figure()

# Add the line for Value
fig.add_trace(go.Scatter(
    x=monthly_sum_sales['month-year'], 
    y=monthly_sum_sales['value'], 
    name='Total Sum Value', 
    line=dict(color='red'),
    yaxis='y1'
))

# Add the line for Quantity
fig.add_trace(go.Scatter(
    x=monthly_sum_sales['month-year'], 
    y=monthly_sum_sales['quantity'], 
    name='Total Sum Quantity', 
    line=dict(color='green'),
    yaxis='y2'
))

# Update layout for dual y-axes
fig.update_layout(
    title="Monthly Sum of Sales (Quantity and Value)",
    xaxis=dict(title="Month-Year"),
    yaxis=dict(
        title="Total Value",
        titlefont=dict(color="blue"),
        tickfont=dict(color="blue"),
    ),
    yaxis2=dict(
        title="Total Quantity",
        titlefont=dict(color="green"),
        tickfont=dict(color="green"),
        anchor="x",
        overlaying="y",
        side="right",
    ),
    legend=dict(x=0.5, y=1.1, orientation="h"),
)

# Show the plot
fig.show()

For understanding overall monthly performance, aggregate the data by summing the sales for each month.

The trend has visible fluctuations, with peaks in certain months(May, July and October), it might suggest seasonal effects.

Performance analysis¶

In [28]:
# top 5 most frequently purchased products (based on Quantity).

# Group by product and sum the quantities
top_products = df4.groupby('anonymized_product')['quantity'].sum().reset_index()

# Sort in descending order and select the top 5
top_5_products = top_products.sort_values(by='quantity', ascending=False).head(5)

# Display the result
top_5_products
Out[28]:
anonymized_product quantity
338 product-66e0 46957
750 product-e805 42602
476 product-8f75 37566
127 product-29ee 35940
213 product-4156 28487
In [32]:
# Create a bar graph
fig = px.bar(
    top_5_products,
    x='anonymized_product',
    y='quantity',
    title='Top 5 Most Purchased Products (Quantity)',
    labels={'anonymized_product': 'Product', 'quantity': 'Total Quantity'},
    text='quantity'  # Display the quantity on the bars
)

# Customize the layout
fig.update_layout(
    xaxis_title='Product',
    yaxis_title='Total Quantity',
    template='plotly_white'
)

# Show the plot
fig.show()
In [33]:
# top 5 most valuable products (based on Value).

# Group by product and sum the quantities
top_value_products = df4.groupby('anonymized_product')['value'].sum().reset_index()

# Sort in descending order and select the top 5
top_value_products = top_value_products.sort_values(by='value', ascending=False).head(5)

# Display the result
top_value_products
Out[33]:
anonymized_product value
750 product-e805 262787281.0
476 product-8f75 158797460.0
338 product-66e0 70704225.0
127 product-29ee 68248274.0
213 product-4156 56956007.0
In [35]:
# Create a bar graph
fig = px.bar(
    top_value_products,
    x='anonymized_product',
    y='value',
    title='Top 5 Most Valuable Products (Value)',
    labels={'anonymized_product': 'Product', 'value': 'Total Quantity'},
    text='value'  # Display the quantity on the bars
)

# Customize the layout
fig.update_layout(
    xaxis_title='Product',
    yaxis_title='Total Value',
    template='plotly_white'
)

# Show the plot
fig.show()
In [ ]: